Goto

Collaborating Authors

 cola mnli-m -mm mrpcqnliqqprtesst-2 sts-bavg


1579d5d8edacd85ac1a86aea28bdf32d-Supplemental-Conference.pdf

Neural Information Processing Systems

KD has been extensively applied to computer vision and NLP tasks [52] since its debut. B.1 KnowledgeDistillation Knowledge Distillation (KD) [16] has been playing the most significant role in overcoming the performance degradation of model compression as the smaller models (i.e., student models) can absorb the rich knowledge of those uncompressed ones (i.e., teacher models) [40, 25, 43, 14]. Forthesecond partASi (ATi)istheattention matrix corresponds to thei-th heads (in our setting,h = 12). In the final part, the dimensionc in logit outputs (pS and pT) is either to be2 or 3 for GLUE tasks. Here weexplain inmore details.One-StageKD means wenaivelyminimize the sum of teacher-student differences on hidden-states, attentions and logits.


XTC: ExtremeCompressionforPre-trained TransformersMadeSimpleandEfficient

Neural Information Processing Systems

Asaresult,wefindoutthatprevious baselines for ultra-low bit precision quantization are significantly under-trained. Based on our study,we propose asimple yet effectivecompression pipeline for extreme compression, named XTC.